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Abstract.  The  most  popular  method  for  the  solution  of  linear  systems  of  equations  with  Toeplitz 
coefficient  matrix  on  a  single  processor  is  Levinson’s  algorithm,  whose  intermediate  vectors  form  the 
Cholesky  factor  of  the  inverse  of  the  Toeplitz  matrix.  However,  Levinson’s  method  is  not  amenable 
to  efficient  parallel  implementation.  In  contrast,  use  of  the  Schur  algorithm,  whose  intermediate 
vectors  form  the  Cholesky  factor  of  the  Toeplitz  matrix  proper,  makes  it  possible  to  perform  the 
entire  solution  procedure  on  one  processor  array  in  time  linear  in  the  ojder  of  the  matrix. 

By  means  of  the  Levinson  recursions  we  will  show  that  all  three  phases  of  the  Toeplitz  system 
solution  process:  factorisation,  forward  elimination  and  backsubstitution,  can  be  based  on  Schur 
recursions.  This  increased  exploitation  of  the  Toeplitz  structure  then  leads  to  more  efficient  parallel 
implementations  on  systolic  arrays. 

To  appear  in:  Proceedings  of  the  IMA  Workshop  on  Numerical  Algorithms  for  Modern  Parallel 
Computer  Architectures ,  M.H.  Schultz,  ed.,  Springer  Verlag,  1987. 
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Introduction 


The  turn  of  this  paper  is  to  discuss  parallel  methods  for  the  solution  of  linear  systems  of 
equations 

Tnx  =  f 

whose  coefficient  matrices  Tn  are  dense  symmetric  positive-definite  Toeplitz  matrices. 

A  symmetric  Toeplitz  matrix  Tn  =  [tu)o<k,l<n  of  order  n  -I- 1  is  a  matrix  whose  elements  are 
constant  along  each  diagonal,  tu  =  The  solution  of  a  general,  n  x  n  system  of  equations  by 

a  direct  method  requires  0(ns)  operations.  Since  a  n  x  n  Toeplitz  matrix  is  characterised  by  O(n) 
rather  than  0(n2)  parameters  efficient  algorithms  for  the  solution  of  Toeplitz  systems  exhibit  an 
operation  count  that  is  considerably  smaller:  the  classical  Levinson  and  Schur  algorithms  require 
0(n2)  operations  [1,  14,  15,  18]  while  the  doubling  algorithms  require  0{n log2  n),  cf.  the  early 
references  [2,9].  A  thorough  treatment  of  Toeplitz  matrices  is  given  in  [10, 11],  and  a  brief  summary 
can  be  found  in  [17].  Numerical  aspects  of  algorithms  for  Toeplitz  matrices  are  reviewed  in  [5]. 

Development  of  parallel  implementations  for  the  solution  of  dense  Toeplitz  systems  was  moti¬ 
vated  by  the  need  to  execute  certain  signal  processing  tasks  in  real-time.  The  preferred  architectures 
are  systolic  arrays,  special-purpose  devices  built  with  Very  Large  Scale  Integrated  (VLSI)  circuit 
technology  [13].  Systolic  arrays  are  homogeneous  networks  of  tightly  coupled,  highly  synchronised, 
simple  processors  that  essentially  operate  in  SIMD  (Single  Instruction  Multiple  Data  stream)  mode. 
Due  to  the  repetitiveness  of  the  computations  and  the  regularity  of  the  data  dependencies  systolic 
implementations  can  be  described  by  means  of  linear  transformations:  the  processor  in  which  a 
quantity  r<  j  is  computed  as  well  as  the  time  of  its  computation  is  expressed  as  a  linear  function  in 
the  indices  »  and  j  [7,  8].  To  keep  the  approach  simple  and  intuitive,  implementation  details  will 
be  omitted  in  this  paper,  they  can  be  found  in  [7,  8]. 

Three  classes  of  systolic  arrays  will  be  presented  whose  efficiency  improves  with  increased 
exploitation  of  the  Toeplitz  structure  in  various  phases  of  the  solution  process. 

The  classical  method  of  choice  for  solving  a  n  x  n  symmetric  positive-definite  Toeplitz  system 
on  a  single  processor  is  the  Levinson  algorithm  [14].  The  intermediate  vectors  generated  by  the 
Levinson  algorithm  form  the  Cholesky  factor  of  the  inverse  of  the  Toeplitz  matrix.  Due  to  a 
sequence  of  n  inner  products,  however,  the  lower  bound  of  the  parallel  solution  time  on  n  processors 
is  0(n  log  n). 

The  second  classical  method  is  the  Schur  algorithm  [1,  15,  18];  its  intermediate  vectors  form 
the  Cholesky  factor  of  the  Toeplitz  matrix.  Although  its  operation  count  is  fifty  percent  higher 


than  that  of  Levinson’s  method,  it  is  more  amenable  to  parallel  implementation:  an  array  of  0(n) 
processors  can  determine  the  Cholesky  factor  of  an  order-n  Toeplitz  matrix  in  time  O(n)  [3,  4,  6, 
7,  8,  12,  16]. 

The  solution  to  the  Toeplitz  system  can  be  found  by  performing  forward  elimination  with  the 
transpose  of  the  Cholesky  factor  and  subsequent  backsubstitution  involving  the  Cholesky  factor. 
This  necessitates  the  additional  use  of  arrays  for  triangular  system  solution  and  intermediate  storage 
of  order  0(nl)  for  the  Cholesky  factor  during  forward  elimination  [12,  16]. 

Instead  of  performing  the  usual  forward  elimination,  recursions  similar  to  the  ‘Schur  recursions’ 
in  the  factorisation  may  can  be  employed  to  modify  the  right-hand  side  vector,  thus  making  it 
possible  to  employ  the  same  type  of  array  for  factorisation  and  forward  elimination.  Intermediate 
storage  of  the  Cholesky  factor  till  the  start  of  backsubstitution  may  be  avoided  by  re-generating  it 
on  the  fly  [3,  4]. 

A  final  improvement  in  efficiency  is  achieved  by  also  performing  backsubstitution  by  Schur 
recursions.  In  this  case  it  is  possible  to  perform  the  whole  solution  process  on  one  n-processor 
array  in  time  0(n)  [7,  8]. 

Since  it  appears  impossible  to  conceive  systolic  implementations  of  doubling  algorihms,  it  can 
be  concluded  that  the  most  efficient  method  hitherto  to  solve  Toeplitz  systems  on  systolic  arrays 
is  one  that  makes  maximum  use  of  Schur  recursions. 

Notation 

A  symmetric  Toeplitz  matrix  of  order  n  +  1  will  be  denoted  by  T„,  where 

/to  *i  .  tf»  \ 

h  to  i 


: 

\tn  .  h  t0J 

and  the  sequence  t,- . . .  t*  of  Toeplitz  matrix  elements  for  i  <  k  will  be  denoted  by  t,:*.  Frequent  use 
will  be  made  of  the  fact  that  the  first  and  last  columns  of  Tn  have  the  respective  representations 
to-n  and  Jto.n  where  J  =  ( en  ...  t\  eo )  represents  the  ‘exchange’  matrix  with  ones  on  the 
antidiagonal,  and  e,  is  a  (n  +  1)  x  1  vector  with  a  one  in  position  i  and  zeros  everywhere  else, 
0  <  i  <  n.  At  last,  0*  stands  for  the  k  x  1  vector  consisting  of  k  zero  elements;  when  k  =  0  it  is 
the  empty  vector. 


A  symmetric  Toeplitz  matrix  Tn  is  centro-symmetric,  that  is 


Tn  =  JTnJ. 


Direct  methods  with  operation  count  0(n2)  or  less  for  the  solution  of  dense  Toeplitz  systems  of 
order  n  exploit  this  property  and  the  resulting  recursive  nesting  of  Toeplitz  matrices: 


/ 

^  ) 

(  to 

tl  ...  tn  y 

Tn- 1 

h 

=  Tn  =  JTnJ  = 

tx 

Tn- 1 

.  tn  •  •  •  tl 

*0  ; 

x  tn 

/ 

The  system  Tnx  =  /  can  be  solved  by  recursively  solving  a  system  involving  Tn-i- 


The  Levinson  Algorithm 


Levinson’s  algorithm  computes  the  Cholesky  factorisation  of  the  inverse  of  a  n  x  n  Toeplitz 
matrices  with  0(n2)  operations;  the  following  derivation  is  partly  based  on  the  one  given  by  Trench 
in  [19].  Suppose  the  Cholesky  factor  Ln  of  T~x  =  L^D^lLn  is  known,  where  Ln  is  unit  lower 
triangular  and  Dn  is  diagonal  (as  Tn  is  symmetric  positive-definite  the  diagonal  elements  of  D„  are 
strictly  positive).  The  Toeplitz  matrix  Tn+ 1  of  next  higher  order  can  be  partitioned  into  a  2  x  2 
block  matrix  as 

(Tn  t\ 

Tn+l  —  I  T  ^  I.  t  =  1  =  (*n+l  •••  *1  )  • 


Solving 


Tn+lTn+l  — 


where  7n+i  is  the  identity  matrix  of  order  n  +  1,  results  in  a  2  x  2  block  partitioning  for  Tn*x 


T~1  — 
*n+l  ~ 


fr-x  +  T-htTT-l/d 


-tTT~l/d 


-T-H/d\ 
1/d  )' 


Setting  ip  =  ~T~1t  yields 

,  _  ( T~l  +  iptpT /d  +/d\_  (In  iP\(T~2  I„  \ 

n+1  {  iPT/d  l/d)  [l  ){  d-')  UT  1/ 


=  Lt 


n+l^n+l^n+l' 


(1) 
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Thus,  ( ipT  1 )  is  the  trailing  row  of  the  Cholesky  factor  Ln+ 1  of  T~+v  It  remains  to  show  that  ip 
and  d  can  be  computed  in  O(n)  steps. 

To  this  end,  suppose  that  the  Cholesky  factorisation  of  T~l  is  already  known: 

'do 

(  T__*,  +  iPniPL  /dn  Ipn/dn  \  I 

T1 


r-1  -  ltD~1L  -  (Tn~i  +  ^/dn  ^n/dn\  D 
r„  -  inDn  i.  -  ^  </(t  1/dB  j .  d. 


dn 


where  do  =  to  and  ipn  is  a  n  x  1  vector.  The  symmetry-centro  symmetry  of  Toeplitz  matrices 
implies  for  the  inverse  of  the  next  higher-order  matrix  Tn+i  that  T~h  =  JT~*XJ,  or  in  block  form 

n+lJ/dn+ 

+  d  ipn+lip 


(Tn  1  +  V'n+lV’n+l/^n+l  V’n+lM.+l  \  _  /  l/d«+l  V^+l^Mi+l  \ 

V  V'l+lMt+l  1/^n+l  )  yJ'J'n+l/dn+l  T~l  +  •JV’n+lV’»+l^M»+l  / 


(2) 


This  gives  for  the  trailing  row  ipn+i  =  ~Tn  1  Jti  n+i  of  Ln+j  the  block  form 

~l/dn  —  ipn  J / dn  \  /  *n+l  \  _  /  ~ (*n+l  +  *Pnt\\n)/dn  \ 

Jdn  ~T~}1  -  JiPntfd/dn  )  V  ^1*  /V  -  JiPn(tn+ 1  +  V£*l:n)M,  J 

Denoting  the  term  in  brackets  by 


(- 


Pn+1  =  -(tn+1  +V>^l:n)/d„  =  ~  [ipl  l)tl:„+l/d„,  Pi  =  -tl/t0) 


(3) 


and  observing  that  ipn  =  -Tn}lJtin  gives 

V'n+i  = 

\  V'n 


Pn+l 


.)■ 


I>n  +  Pn+l  J  Ipn  , 

Consequently,  with  V'o  the  empty  vector,  the  trailing  row  of  Ln+ 1  can  be  obtained  from  the 
trailing  row  of  Ln  via 


m-  * 


0 1 

(  1  ) 

V-n 

+  Pn+l 

J*Pn 

1  J 

0  i 

(4) 


Remembering  that  d~l  is  the  bottom  right  element  of  Tn  1  and  pn+i  is  the  leading  element  of  V’n+i 
one  gets  with  (2)  for  the  bottom  right  element  of  T~ +*j 


dn+l  =  dnl  +  Pn+l<*n+l  Or  dn+\  =  dn(  1  -  P*  +  l)>  d0  =  t0. 


(5) 


Note  that  the  original  paper  by  Levinson  [14]  does  not  contain  the  simple  recursive  computation 
of  dn+ 1  from  dn  and  pn+ 1 
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The  Levinson  Algorithm 

The  Levinson  algorithm  computes  the  lower  triangular  Cholesky  factor  Ln  of  T~l  with  Jbth  row 
given  by  (^*>0  ...  4>k,k-i  1  0 %_k). 


do  =  to 


1  <  jfc  <  n, 


k- 1 

Pk  =  -(**  +  ^2  'l>k-i,j-\tj)/dk-\,  dk  =  d*-i(l  -  pi),  ipk, o  ~  Pk 
i= i 


( V't.o 


•••  V't.t-l)  =  ( 1  Pk) 


(^k- 1,0 

' h-l,k-2 


V't-l.t-J 
V'fc-i,  0 


The  reason  why  Levinson’s  algorithm  has  little  potential  for  parallelisation  is  that  the  vector 
ipn  enters  into  the  computation  of  V’n+i  from  both  ends:  in  a  linear  combination  of  rpn  and  Jipn. 
Even  if  one  were  to  maintain  two  separate  copies,  ij>n  and  the  weight  pn+ i  in  this  combination 
would  still  depend  on  the  entire  vector  rl>n.  Thus,  it  is  not  possible  to  pipeline  successive  recursions, 
and  the  lower  bound  on  the  time  of  a  n  x  n  parallel  Cholesky  factorisation  of  is  O(nlogn). 

The  Schur  Algorithm 

Since  the  inner-product  (3)  in  the  formation  of  pn+1  is  the  culprit  for  the  poor  parallel  per¬ 
formance  of  Levinson’s  algorithm  one  could  try  to  reformulate  the  algorithm  so  as  to  obviate  the 
need  for  an  explicit  inner-product  computation.  This  is  accomplished  by  observing  that  (1)  implies 
dn  —  (V’n  1 )  Jto.n,  and  substituting  this  into  (3)  leads  to 

_  ( V'n  l)tl:n+l 

t firijJhZ' 

Thus,  the  coefficient  pn+i  is  the  ratio  of  two  quantities  that  are  obtained  by  multiplying  ( V’n  0 
and  its  reverse  by  a  column  of  the  Toeplitz  matrix.  This  is  the  basis  for  the  so-called  Schur  algorithm 
[1,  15,  18],  it  avoids  the  inner-product  by  recursively  ‘updating’  matrix-vector  products  involving 
the  Toeplitz  matrix,  so  that  pn+i  can  be  formed  as  the  ratio  of  two  vector  elements. 

Unlike  the  Levinson  algorithm  which  determines  the  Cholesky  factor  of  the  inverse,  the  Schur 
algorithm  determines  the  Cholesky  decomposition  of  the  matrix  proper.  If  the  result  of  Levinson’s 
method  is  T~l  =  L^D~lLn ,  where  Ln  is  lower  triangular,  then  the  uniqueness  of  the  Cholesky 


decomposition  implies  that  LnTn  =  D„L~T  must  be  a  scaled  version  of  the  upper  triangular 
Cholesky  factor  Un  of  Tn:  Tn  =  DnUn.  Therefore  it  follows  that  ((</)£  1)  )  Tn  is  the 

Acth  row  of  the  scaled  Cholesky  factor  DnUn  of  Tn.  Let  0^  represent  a  row  vector  of  k  zeros,  then 


the  Aith  row 

of  DnUn  has  the  form 

(Tk 

Ak  *  \ 

((tf 

1)  °L *)r.  =  ((tf  1)  <£-*-.  o) 

T„-lc-2  *  I 

\  * 

o 

» 

=  ((#  1)2*  (V-r  1  )Ak  *)  =  (0[  dk  (V-J  1  )Ak  *) 

=  (o  *  7*  t'i.O  •••  Vk,n-k- 1)»  (6) 

where  *  denotes  unimportant  terms  and  from  (1) 

(tf  1)7*  =  (0l  dk).  (7) 

If  it  is  possible  to  compute  the  non-zero  elements  (  dk  vk,o  •  •  ■  vk,n-k- 1 )  of  the  fcth  row  of 
DnUn  with  0(n  -  k)  operations  then  the  Cholesky  factorisation  Tn  =  U% DnUn  can  be  computed 
with  0(n2)  operations.  It  is  now  shown  how  to  compute  the  vector  of  uk  j  as  a  linear  combination 
of  two  vectors  by  making  use  of  the  Levinson  recursion  as  follows: 

(K+i  1)  <£_*_,  )r„  =  (o  (Vf  1)  or_*_1)T»  +  AM.1((i£  1)7  0  0l_k_,)Tn. 

With  (3),  (7)  and  (6)  the  first  summand  evaluates  to 

/  *£*+1  *  'l 

(0  (itf  1)  0^)hi:W  Tk  Ak 

\  *  Ak  T„-k-3  J 


=  ((^k  l)ti:k+i  (V't  1  )Tk  (V»*  1)  Ak)  =  (~pk+idk  0 1  dk  {xpl  1 )  Afc ) 


=  (~ 

Pfc+idfc  0*  dk 

X'k.O  •  ■  • 

Vk.n- 

-k-2 )  • 

The  second  summand  amounts  to 

f 

Tk  Jt 

l:k+l 

( (  Vf  1)7 

o  <£_*_,) 

tf.k+lJ 

to 

* 

V 

*1 

* 

Tn-k-1 ! 

=  ((^T 

1)7T* 

1  )*l:k+l 

l)JBk) 

=  (dk 

°*  A*k,0  •  •  • 

Pkxn-k- 1  )  ■ 

-Pk+idk  {$1  l)JBk) 
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Different  schedules  and  processor  assignments  for  individual  operations  can  be  derived  by  applying 
appropriate  linear  transformations  to  the  indices  of  the  computed  quantities:  the  pair  ( i/kj  /i*,y  ) 
is  computed  in  processor 


x  =  {k  j 


at  time 


r  =  ( fc  j 


"(3+'! 

°C)+rs- 


The  partial  order  of  computations  is  preserved  by  observing  that  vk-ij  must  be  computed  before 
i/*j,  and  Hk-ij+i  before  pkj-  That  is,  the  time  function  must  satisfy 

-T\  <  0,  —T\  +  T2  <  0. 

In  general,  if  a  quantity  with  index  (*,/)  depends  on  a  quantity  with  index  (k,l)  the  latter  must 
be  available  before  the  former  can  be  determined,  in  other  words  [8] 


( k-i  <-;)(  J  <0- 

To  make  sure  that  one  processor  does  not  have  to  perform  two  different  operations  at  the  same 


time  the  determinant  of  the  matrix 


(n  ") 

V  r2  X2) 


should  be  non-zero  [8],  t\x2  -  t2x\  ^  0. 

The  first  systolic  array  presented  in  [12]  is  based  on  the  following  linear  transformations.  In 
step  k  of  the  Schur  algorithm,  0  <  k  <  n,  ( ukj  fikj )  is  computed  in  processor  x  at  time  r  >  1 
where 

x  =  {k  j)  f  ^  +  0  =  J,  t  —  [k  j)  j  +  1  =  *:  +  1,  0<j  <n-k-l. 

The  parameters  pk  and  dk  are  assumed  to  be  associated  with  index  (k,  0)  and  thus  determined  in 
processor  x  =  0  at  time  r  =  k  +  1.  Initially  processor  j  is  loaded  with  matrix  element  ty. 

The  execution  of  the  Schur  algorithm  requires  n  +  1  steps.  In  each  step  processor  0  computes 
a  new  p  and  broadcasts  it  to  all  other  processors,  so  that  each  step  produces  a  new  row  of  the 
Cholesky  factor.  The  components  of  the  i'-vectors  remain  in  their  respective  processors  (i/kj  resides 
in  processor  j  for  all  k)  while  the  ^-vector  is  shifted  left  by  one  in  each  step  is  computed 


,J  'J  -*  «>.  -* LV.VuV.V V 


in  processor  j  +  1  before  being  sent  to  processor  j  for  the  computation  of  Note  that  in  step  k 
there  are  k  idle  processors,  and  in  order  to  offload  a  row  of  the  Cholesky  factor  each  processor  must 
be  able  to  perform  external  I/O. 

To  avoid  difficulties,  such  as  synchronisation  delays  and  long  wires,  associated  with  a  global 
communication  scheme  like  broadcasting  a  ‘pipelined’  array  is  proposed  in  [12,  16].  The  processor 
function  ( xi  xj  xj )  is  the  same  as  before  but  the  time  function  has  changed  to 

( n  Tt  rs )  =  ( 2  1  1 ) 

Thus,  pk  and  dk  are  computed  at  time  r  =  2fc  +  1  in  processor  0  and  pk  is  then  sent  to  processor  1. 
In  the  next  step,  at  r  =  2Jfc  +  2,  (/i*i  vk%\ )  can  be  computed  in  processor  1,  pk  can  be  transmitted 
to  processor  2  and  Hk,i  left  to  processor  1  so  that  at  r  =  2k  +  3  the  computation  of  pk+i  can  start. 
Thus,  successive  iterations  are  two  time  steps  apart.  Because  p„  and  d„  are  computed  at  r  =  2n  +  1 
the  computation  time  for  the  Schur  algorithm  comes  to  2(n  +  1).  The  replacement  of  broadcasting 
by  forwarding  (or  pipelining)  of  p  from  processor  to  processor  results  in  communication  that  takes 
place  exclusively  on  a  nearest  neighbour  basis.  All  other  features  are  the  same  as  in  the  first  array. 
In  order  to  solve  the  system  Tnx  =  f  three  possibilities  are  discussed  in  [12]: 

1.  forward  elimination  and  backsubstitution  involving  the  Cholesky  factors  Un,  Dn  and  of  Tn 

2.  computation  of  the  Levinson  vectors  rpk  using  the  p*  from  the  Schur  algorithm,  and  subsequent 
matrix  vector  multiplications  involving  Ln>  Dn,  and  Z,£  (the  \pk  constitute  the  rows  of  the 
Cholesky  factor  Ln  of  T~l) 

3.  explicit  computation  of  T~l  in  form  of  the  Gohberg-Semencul  formula  and  subsequent  matrix- 
vector  multiplications  by  means  of  FFTs  (the  Gohberg-Semencul  formula  represents  the  inverse 
of  a  Toeplitz  matrix  as  a  sum  of  products  of  triangular  Toeplitz  matrices  that  consist  of  the 
elements  of  tpn). 

Since  details  for  parallel  implementations  of  the  latter  two  methods  are  not  given  and  their 
data  and  control  flows  are  likely  to  be  rather  complex  only  the  first  method  will  be  considered. 
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Forward  Elimination  with  Cholesky  Factor 

Forward  elimination  solves  the  lower  triangular  system  ( DnUn)Th  =  /,  the  elements  of  the  solution 
vector  h  are  given  by  hk  =  /»**. 

ho,o  =  fo/do 
l  <  k  <  n,  hk.  i  =  0 

1  <  j  <  *-  1,  Vr  = 

V*  =  (/*  -  hk,k-\)/dk 


In  order  to  overlap  forward  elimination  as  much  as  possible  with  factorisation  a  second  linear 
array  with  n  +  1  processors  is  employed,  and  it  is  assumed  that  each  processor  in  the  elimination 
array  is  physically  connected  to  the  corresponding  processor  in  the  factorisation  array  Since  a 
matrix  element  j/;_i  is  computed  in  processor  k  -  j  at  time  r  =  k  +  j  -  2  in  the  factorisation 
array  it  may  be  used  at  the  next  time  step  in  processor  it  -  j  of  the  forward  elimination  array 
Hence,  the  forward  elimination  array  has  the  processor  function 


(jti  *2  x3)  =  (l  -1  0), 


and  essentially  the  same  time  function  as  the  factorisation  array  (the  time  functions  just  differ  by 
one  in  their  displacement  r3): 


( n  r2  r3  )  =  ( 2  1  2 ) 


Note  that  the  elements  of  the  h-vector  are  shifted  one  processor  to  the  right  each  step.  At  time 
r  =  Zk  +  2,  fk  and  dk  have  to  be  input  to  processor  0  of  the  elimination  array  so  that  hk  =  hk,k 
can  be  computed  there.  Thus,  forward  elimination  is  completed  after  the  computation  of  hn  at 
time  3n  +  3 . 


Backsubstitution  with  Choleaky  Factor 

Backsubstitution  determines  the  solution  x,  with  elements  xk  =  xk<k,  of  the  upper  triangular  system 
DnUnx  =  Dnh. 

%n,n  ~  dn^n/dn 
n  -  1  >  k  >  0,  xkin+1  =  0 

n  >  j  >  k  +  1,  xkj  =  xkj+ 1  +  t'kj-t-i Xjj 
*k,k  =  [dkhk  -  xk>k +i)/dk. 


As  backsubstitution  can  only  start  once  forward  elimination  is  completely  finished,  the  forward 
elimination  array  may  be  re-used.  Its  time  and  processor  functions  are  now 

(*i  *2  *s)  =  (-l  1  0),  (n  r2  rs)  =  (-l  -1  5n  +  3). 

If  processor  0  has  retained  all  hk  and  dk  from  the  forward  elimination  phase  then  the  solution 
element  xk  =  xk  k  can  be  determined  in  processor  0  at  time  r  =  5n  +  3  -  2k.  Note  that  solution 
of  a  Toeplitz  system  of  order  n  in  such  a  manner  requires  time  0(5n)  on  2 n  processors  plus  0{n7) 
storage  to  store  the  matrix  DnUn  during  the  forward  elimination  phase. 

Forward  Elimination  by  Schur  Recursions 

The  second  step,  the  modification  of  the  right-hand  side  vector  /  in  the  system  T„x  =  f  can 
be  improved  for  a  systolic  implementation  by  applying  the  same  operations  to  /  as  were  applied  to 
Tn  in  the  Schur  algorithm:  after  having  determined  LnT„  =  DnUn  one  now  determines  g  =  Lnf  so 
processors  perform  the  same  type  of  operations  during  facorisation  and  forward  elimination,  and 
only  one  type  of  array  is  needed  for  both  phases. 

To  derive  the  computational  steps  for  g  =  Lnf  we  extend  the  vector  /  to  a  Toeplitz  matrix  Fn 
with  /  as  its  first  column: 


/  = 

f  fo\ 

A 

,  Fn  = 

(  fo  fl  .  fn  > 

A  fo 

\fn) 

fo  A 

V/n  .  A  fo' 

11 


V  W  'J.  V’ji 


.*•  -  V- 


■ 


'•-I 


IVIV 


From  the  computation  of  LnFn  one  can  derive  recur* ion*  for  L*f  by  means  of  the  following  obser¬ 
vation.  The  fcth  element  of  g  —  Lnf  is 


1 


f4* 

A 

r 


£ 

s 


I 


1 


»*  =  ((^r  *)  °i-*)/  =  (^r  i)/o*.  o <*<n, 

while  the  fcth  row  of  Ln  F„  is 

((*1  i)  <£.*)*.  =  ((*?  i)  ^ 

=  (•  (v-r  t)f*) 

whose  Arth  element  is  the  trailing  element  of  (  1 )  F*  which  is  equal  to  (  V*  1  )  J/o  *  Hence, 

the  trailing  element  o{  (\l>l  1  )  J  Fk  in 

((W  1M  o;.k)F,  =  ((¥-f  1)JF*  •) 

^9k-(^I  l)/o* 

Consequently,  the  sought  vector  g  is  a  product  of  /  with  the  matrix  whose  rows  contain  the 
reverse  Levinson  vectors,  and  g  can  be  computed  by  the  following  Schur-like  recursions  involving 
the  upper  triangular  part  of  this  matrix 

Denote  elements  in  position  k  <  i  <  n  of 

(p  \ 

cj i)jFk  i)j€k) 


(<»*.» 

Qk.« 

n 

s. 

Sws 

«r 

)  Jck ) 

and  elements  in 

position  A  <  • 

<  n  of 

Dk  Jfnn 

((*>* 

»)  <£-*)*, 

ii 

»*-t 

•)  <£-e-i  0)  Dl 

Fn  k  J  * 

fin  J 

•  fa 

=  ((*i 

l)Fk  (I*  1  )Dk  (1,1 

1  )Jfnn  k) 

(0k, k  A,,)  =  ((^T  l)  J  fo  k  {*1  l)Dt  («f  \)Jfnn  k)  (8) 

Now  a i, j,  -  g*  is  the  fcth  element  of  g  and  the  recursive  computation  of  a*.i  j*.i  from 

as  ,  and  0kl  can  be  derived  by  means  of  the  Levinson  recursions  (4)  as  follows 

((v£i  »)  <>Zkl)Fn  (Ul  1)7  0*,  )•*.,(<)  (v[  1)  o;tl)Fw 
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Ignoring  elements  in  positions  0  through  k  on  both  sides  of  the  equation  gives 


(<»*+l,*+l  •••  <**+l,n)  =  ...  Qrfc.n)  +  Pk+l  {Pk, It  ■■■  0k,n-l) 

since  the  second  summand  is  equal  to 

/  /o  fi.k  *  \ 

(0  (tf  1)  0j_k_t)  I  fa,  Fk  Dh  =  ((tf  l)fi.k  (tf  1  )Fk  (V*  1)Z>*). 

V  *  D\  Fn-k.J 

Comparing  elements  in  positions  k  +  1  through  n  with  (8)  one  notes  that 
((4>l  l)Jf0k  (rf  l)Dk)  =  (0kk  ...0k,n-l)- 
The  second  vector  consisting  of  elements  0k+i,%,  4  +  I  <  »  <  n,  can  be  updated  similarly. 


Forward  Elimination  iy  Sckar  Recnraions 

The  Schur  recursions  determine  g  =  Lnf  where  gk  =  ak  k. 


( 


<*o,o 

00,0 


1  <  k  <  n, 


<*0,n  \  /  /o  •  •  •  /ii  \ 

0O,n  /  \  /o  fn) 

( <**,*  ak,n\  _  (  1  Pk\  (  <*k-l,k 

\0*,*  0k, n  J  \P*  1/  \d*- 1,4-1 


<*4-1, *» 
0k— l, n—  1 


Second  Class  of  Systolic  Implementations 

Again,  the  same  assumptions  as  before  hold,  and  the  array  from  [3, 4]  is  presented  that  performs 
both  phases,  factorisation  and  forward  elimination,  by  Schur  recursions. 

The  pipelined  version  of  the  factorisation  phase  is  performed  as  before.  As  for  forward  elimi¬ 
nation,  the  index  structure  of  its  equations  is  adapted  to  that  of  the  factorisation  by  performing  a 
linear  transformation  on  each  index  ( k  j). 

(*  j)(^  ~J=(4  >-*)’ 
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resulting  in 


afc-l,n-(k-l) 
0k-l,n- (*-!)+! 


If  a  second  (n+  l)-processor  array  is  available  for  forward  elimination  with  the  same  time  and 
processor  functions  as  those  of  the  factorisation  array,  then  factorisation  and  forward  elimination 
can  be  performed  simultaneously.  Processor  0  of  the  forward  elimination  array  is  assumed  to  be 
connected  to  processor  0  of  the  factorisation  array  so  the  latter  can  forward  the  p*  to  the  former. 

When  pipelining  is  used,  processor  0  of  the  elimination  array  receives  p*  and  forwards  it 
directly  to  the  other  processors  in  the  array  so  pairs  ( a^  j  0k  j  )  are  computed  in  processor  j  at 
time  2fc+i+l.  Initially,  processor  j  is  loaded  with  the  right-hand  element  fj.  Element  <7*  =  a*  0  is 
computed  in  processor  0  at  time  r  =  2k  + 1  and  transmitted  to  processor  0  of  the  factorisation  array 
where  it  is  retained  till  the  start  of  the  backsubstitution  phase.  Thus,  factorisation  and  forward 
elimination  can  be  executed  on  2(n  + 1)  processors  in  2(n+ 1)  time  steps  if  communication  proceeds 
on  a  nearest  neighbour  basis. 

In  order  to  avoid  the  0(n2)  storage  needed  to  store  D„Un  till  the  onset  of  backsubstitution 
only  its  last  column  and  the  parameters  p*  are  retained  from  which  DnUn  can  be  re-generated  by 
the  Schur  recursions. 


The  Reverse  Version  of  the  Schur  Algorithm 

The  reverse  version  of  Schur  algorithm  computes  the  scaled  Cholesky  factor  DnUn  of  Tn  with  fcth 
row  given  by  ( o£  d*  */*(o  . . .  I'k.n-k-i  )  from  p*,  1  <  k  <  n,  and  the  last  column 

{vo.n- 1  v\  ,n-2  •••  ‘'n-J.O  dn  )* 

of  DnUn,  whereby  n>,n-i  =  tn. 
n  -  1  >  k  >  0, 

Vk,0  •••  Vk,n-k- j\  _  1  (  “1 

Pk,  1  Pk,n-k-l J  Pk+l~1\Pk+l 

dk  =  dk+i/{l  -  pl+i),  Pkfl  =  -dkPk+1- 


'"■)  ( 


‘'k+l.O 

Mk+1,0 


vk+l,n-(k+l)~ 

Mk+l,n-(k+l) 


:) 


L 
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If  processor  0  in  the  factorisation  array  has  retained  all  pk  and  dn,  and  processor  n  -  k  has 
received  component  i>k,n-k-i  of  the  last  column  from  its  left  neighbour  then  the  re-generation  of 
DnU„  in  the  factorisation  array  can  start  at  time  2 (n  +  1).  The  processor  and  time  functions  are 

(*i  xs)  =  (0  1  0),  (n  r2  rs)  =  (-2  -1  4n  +  2), 

so  that  ( Vkj  Hkj+ 1 )  is  computed  in  processor  j  at  time  4n  +  2  —  2k  -  j,  and  d*  in  processor  0 
at  time  r  =  4n  +  2  -  2k.  Processor  0  stores  the  d*  for  the  backsubstitution  phase.  Note  that  the 
components  of  the  */-vector  stay  put  in  a  processor  (i/*j  resides  in  processor  j  for  all  k)  while  the 
components  of  the  p- vector  are  shifted  one  processor  to  the  right  in  each  step. 

Suppose  a  third  array  for  backsubstitution  is  available  whose  processors  are  connected  to  the 
corresponding  processors  of  the  factorisation  array.  Since  Ukj-k-i  is  computed  in  processor  j  —  k  —  1 
at  time  r  =  4n  +  3  —  k  —  j  it  can  be  used  at  time  4n  +  5  in  processor  j  —  k  of  the  backsubstitution 
array.  With  processor  and  time  functions 

(*i  *2  3rs)  =  (-l  1  0),  (n  r2  rs)  =  (-l  -1  4n  +  5), 

Xj  k  can  be  computed  in  processor  j  -  k  at  time  4n  +  5  -  k.  Since  processor  0  has  retained  d*  from 
the  previous  re-generation  phase  and  gk  is  computed  early  enough  in  processor  0  of  the  forward 
elimination  array  (at  time  r  =  2k  +  1),  element  X*  =  Xk,k  of  the  solution  vector  can  be  computed 
in  processor  0  at  time  4n  -I-  5  —  2k.  The  whole  computation  is  completed  in  4n  +  6  time  steps.  Note 
that  during  step  k  of  the  factorisation  and  forward  elimination  phase  there  are  2k  idle  processors. 

Therefore,  6n  processors  can  compute  the  solution  toanxn  Toeplitz  system  in  time  0(4n) 
only  relying  on  nearest  neighbour  computation,  however  the  storage  in  at  least  one  processor  must 
be  proportional  to  the  problem  size  O(n). 

Backsubst  itu  t  ion 

The  last  step  is  normally  solved  by  backsubstitution  x  =  L^D~xg  without  making  any  use  of 
the  Toeplitz  structure  of  the  original  system.  A  new  approach  that  uses  the  Schur  recursions  also 
for  the  last  step  was  derived  in  [7]  and  can  be  related  to  the  Levinson  recursions  as  follows. 

Remember  that  ( ( 1 )  ),  is  the  fcth  column  of  Lj,  that  gk  =  a*,*  is  the  fcth  element 

of  the  vector  g,  and  that  d*  is  the  fcth  diagonal  element  of  Dn,  0  <  k  <  n.  With  the  abbreviation 
7 k  —  gk/dk,  0  <  k  <  n,  one  can  express  the  solution  vector  as  a  linear  combination  of  the  columns 


Define  the  partial  sums 


V'n-t-l 

1 

Ofc+i 


x 


W  =  7» 


*("-*-!)  =  i(n"k)  +  7„-k-i 


,  o< 


so  that  =  x.  It  will  now  be  shown  by  induction  that  for  1  <  k  <  n 


/  0n_*  \ 

f  7n-t,n-k  ' 

(  °k~i+ 1  \ 

£ 

*("-*)  = 

*n-k,n-k 

'  ktn  ' 

+ 

Vn-k.n 

\  0n^k  J 

1=0 

V'n-k-l 

l  o>+1 

k 

+  ^  ^  7n-*,n-> 
>=0 


(i)  For  k  =  0  it  follows  from  the  Levinson  recursion  (4)  that 


where  „  =  7„  =  gn/dn  and  r)nn  =  pn-/n. 

(ii)  Assume  the  statement  is  true  for  Jt  >  0. 


(iii)  Using  the  induction  hypothesis  (ii)  for  x^n  k ^  in 


<  n  -  1, 


'  0;  +  l 


and  making  use  of  the  Levinson  recursion  in  each  of  the  two  sums  of  (9)  results  in 
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for  the  second  sum. 
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Collecting  corresponding  terms  gives  the  following  expression  for  xl 
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which  can  be  written  u 


f  o„_ *_ i  > 

*n  —  k  —  l,n  — fc-  1 

+ 
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This  completes  the  induction. 


The  backsubstitution  part  using  the  Schur  recursions  computes  the  e-  and  ^-vectors  and  can 
be  formulated  as  follows. 


Backaubatitution  by  Schur  Recuraiona 

The  Schur  recursions  determine  the  vector  z  =  L*D~lg  with  its  fcth  element  given  by  zk. 


(. 


n,n 


k  *?n,n  / 

1  <  A:  < 


9n/dn\ 

,  0  J 


■(, 


(*n-k,n-k  (n-k,n 
Vn-k,n-k  Vn-k,n 


-k  + 1 
n  -  k  + 1 


k,n-l  fn-k,r»\ 
k,r»-l  *?n-k,n  / 


^fl— 

Vn-k.n- 


1 


Pn-k\ 

/  0r»- 

k/ dn-k 

tf»-(k-I),n-(k-  1) 

(k- l),n- 1  fn-(k-l),n\ 

*  1  / 

Un-(k- 

-l).n-(t-l) 

*?n-(k- l),n-(k-l)+l 

Vn-[k-  l),n  0  / 

o  <  j  <  n,  Zy  =  fo,>  +  *7o,y  • 


Third  Class  of  Systolic  Implementations 

The  computation  of  all  phases,  factorisation,  forward  elimination  and  backsubstitution,  by 
Schur  recursions  makes  it  possible  to  employ  only  one  array  for  all  three  phases.  The  corresponding 
array  in  [7,  8]  is  the  most  efficient  of  the  three  types  of  designs  presented,  and  can  be  derived  as 
follows  (the  processor  and  time  functions  here  differ  from  the  ones  in  [7,  8]  in  a  few  small  details 
that  do  not  affect  the  asymptotic  computation  time). 
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To  fit  all  three  phases  on  one  array  it  is  convenient  to  adapt  the  index  structure  of  the  factori¬ 
sation  phase  to  that  of  forward  elimination  by  transforming  each  index  (k  j)  in  the  factorisation 


<*  j,(o  :)= 


=  ( *  *  +  i )  • 


The  transformed  factorisation  phase  is  thus  expressed  as: 


do  =  to 

(^0,0  •••  k'O.n-l  \  _  /  *1  •••  *n\ 

P0,0  •••  M0,n-1  /  V*!  •••  **/ 

1  <  k  <  n,  pk  =  -/xt-i.t-i/dt-i,  die  =  d*_j(l  -  p\) 


(Vk,k  ■■■  Vk,n-1  \  _  /  1  Pk\  ( ‘'k-l.k-l 

Pk,k  •••  Mk,n-1  /  \Pfc  1  /  \  Mk-l,fc 


The  time  and  processor  functions  are  chosen  to  be 


t'k-l,n-2 
Mk— l,n— 1 


(*i  X 2  *j)  =  (l  o  0),  (rx  r2  r3)  =  (l  1  2). 

Thus,  all  matrix  elements  are  input  to  the  same  processor:  tk  is  input  to  processor  0  at  time 
r  =  k  +  1;  and  ( i/*j  Pkj )  determined  in  processor  k  at  time  r  =  k  +  j  -f  2.  The  values  of  pk 
and  dk  are  computed  along  with  t /*,*,  in  processor  k  at  time  2k  +  2,  and  remain  in  that  processor 
throughout  factorisation  and  forward  elimination.  Notice  that  the  components  of  the  {/-vector  stay 
put  in  the  processor  while  the  components  of  the  /{-vector  are  shifted  one  processor  to  the  left.  The 
last  quantities  pn  and  dn  are  computed  in  processor  n  at  time  2(n  +  1),  so  the  computation  of  the 
factorisation  requires  2n  +  3  steps. 

Since  the  factorisation  has  the  same  structure  as  the  forward  elimination  phase,  and  the  forward 
elimination  phase  involves  the  pk  which  are  now  computed  in  different  processors  the  two  phases 
may  be  overlapped,  thereby  eliminating  the  processor  idle  time  of  the  previous  designs.  Observe 
that  the  last  matrix  element  tn  is  input  to  processor  0  at  r  =  n  +  2  so  the  first  element  of  the 
right-hand  side  vector  /o  can  be  input  to  processor  0  at  time  n  +  3.  In  general,  all  right-hand 
side  elements  are  input  to  the  same  processor  as  the  matrix  elements:  /,  is  input  to  processor  0 
at  time  n  -I-  j  +  3,  and  time  and  processor  functions  (except  for  the  time  displacement  r3)  are  the 
same  as  before: 

(  xx  x2  x3 )  =  ( 1  0  0 ) ,  (  rx  r2  r3 )  =  ( 1  1  n  +  3  ) . 


Rl 


The  pair  ( a^y  /9*j  )  is  determined  in  processor  k  at  time  r  =  k+j + n  +  3,  and  the  components  of 
both  a~  and  vectors  experience  a  shift  to  the  left  neighbouring  processor  after  their  computation. 
Element  <7*  =  a*,*  is  computed  in  processor  k  at  time  2k  +  n  +  3,  and  forward  elimination  is 
completed  at  time  3 n  +  4. 

To  keep  communication  on  a  nearest  neighbour  basis,  the  linear  array  is  folded  together  so 
that  processors  k  and  n  -  k  are  situated  across  from  each  other.  After  completion  of  the  forward 
elimination  phase  processors  k  and  n  -  k  can  then  exchange  their  values  of  p,  d  and  g  so  that 
processor  k  ends  up  with  pn-k>  dn_k  and  gn-k-  For  simplicity  each  index  ( k  jf )  of  backsubstitution 
is  transformed  to 

/-I  0\ 

=  (n-  k  j), 


( k  j)  ^  0  l  j  +(»  °)  =  ( n 


resulting  in 


( 


tO  ,n 
»70  ,n. 


'gn/dn 

0 


1  <  *  <  n, 


(ek,n-k  ck,n-k+ 1  •••  ek,n- 1  ek,n  \ 

*7k,n— k  *7t,n-t+l  •••  flk.n—  1  Vk,n  J 

Pn-t\  /  gn-k/dn-k  fk-l,n-(t-l)  •••  ek-l,n-l 

Pn-k  l  /  \  •••  Vk-l,n  0  J 


0  <3<n,  Xj  =  €„,y  +  r)nJ. 
With  processor  and  time  functions 


(*l  *2  )  =  ( 1  0  0),  (n  r2  r3)  =  (  2  1  2n  +  4) 

the  pair  ( fk,y  ^k,>  )  is  computed  on  processor  k  at  time  r  =  2fc+y+2n+4.  In  particular,  component 
Xk  =  c n,k  +  *7n,k  of  the  solution  vector  is  computed  in  processor  n  at  time  r  =  4(n  +  1)  +  k.  Hence 
backsubstitution  is  completed  at  time  5n  +  6. 

With  the  above  scheme,  a  Toeplitz  system  of  order  n  can  be  solved  in  time  0(5n)  on  n 
processors  with  nearest  neighbour  commuincation.  Each  processor  requires  only  a  constant  amount 
of  storage.  External  input  takes  place  on  the  first  processor  and  external  output  on  the  last.  As 
shown  in  [7,  8]  the  solution  processes  for  several  different  problems  with  different  right-hand  sides 
can  be  overlapped  and  the  solution  to  a  new  problem  can  be  obtained  every  n  steps.  Furthermore, 
as  shown  in  [8],  the  above  array  belongs  to  the  class  of  n-processor  arrays  that  solve  Toeplitz 
systems  faster  than  any  other  array  with  linear  processor  and  time  function,  and  I/O  restricted  to 
the  boundary  processors. 
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